How Machine Learning can be beneficial for Textual CBR
نویسندگان
چکیده
In this paper, we discuss the benefits and limitations of Machine Learning (ML) for CBR in domains where the cases are text documents. In textual CBR, the bottleneck is often the assignment of indices to new text documents. While ML has the potential to help build large case-bases from a small start-up collection by learning to classify texts under the index-terms, we found in experiments with a real CBR system, that the problem is often beyond the power of purely inductive learning algorithms. The index terms in CBR are very complex and the number of training instances in a typical case base is too small reliably to generalize from the training instances. We believe that adding domain knowledge can help overcome these problems. In the paper, we illustrate a number of ways how domain knowledge can be used. CBR over Textual Cases Case-Based Reasoning has been successfully applied in various domains where the cases are available as text documents. Examples are Legal Reasoning and Argumentation (Aleven 1997; Branting 1991), Ethical Dilemmas (Ashley & McLaren 1994), Medical Applications (Portinale & Torasso 1995; Gierl 1998), Tutoring, Cooking (Hammond 1986), and Helpdesk systems. Up to now, the case bases for systems in these domains had to be constructed by hand. This leads to a bottleneck in creating and scaling up CBR systems, since manual indexing often involves inhibitory costs: Candidate cases have to be retrieved, the most useful cases have to be selected, read and understood, and transcribed into the CBR system’s representation. Thus, methods for automatically assigning indices or improving case representation and reasoning from texts are needed. However, there is a severe gap between the knowledge representation required for CBR and the methods one can perform on textual documents, like in Information Retrieval (IR). Computationally, text documents are an unstructured stream of characters, over which only shallow reasoning based on easily observable surface features can be performed. This reasoning can be distinguished from CBR in many material ways. In CBR, cases are represented in a meaningful, very compact way, and not as very highdimensional, hardly interpretable vectors of floating-point numbers over words. They usually contain only important information, noisy or irrelevant information is filtered out in the indexing process. In CBR, comparison of cases can be performed along multiple important dimensions (Ashley 1990). Cases that only match partially, can be adapted to a problem situation, using domain knowledge contained in the system (Aleven 1997). Thus, methods, like in particular Information Retrieval, which are based only on shallow statistical inferences over word vectors, are not appropriate or sufficient. Instead, mechanisms for mapping textual cases onto a structured representation are required. In the following sections, we will discuss our legal CBR application, and the problems we encountered when using ML methods for deriving the mapping between textual cases and a CBRcase representation. Application: CBR in CATO Our particular CBR application is CATO, an intelligent learning environment for teaching skills of making arguments with cases to law students. The system has been implemented for the domain of trade secrets law, which is concerned with protecting intellectual property rights of inventors against competitors. In the CATO model, cases are represented in terms of 26 factors. These are prototypical fact situations which tend to strengthen or weaken the plaintiff’s claim. Cases are compared in terms of factors, and following Hypo (Ashley 1990), similarity between cases is determined by the inclusiveness of shared sets of factors. Highlevel knowledge about trade secrets law is represented in a Factor Hierarchy, where the base-level factors are linked to more high-level legal issues and concerns via a specific support mechanism. With the Factor Hierarchy, case that match only partially can be compared in terms of abstractions. Using this model, and a Case Database of 150 trade secrets law cases, CATO makes arguments using six basic argument moves. The Factor Hierarchy also enables CATO to reason context-dependently about similarities and differences between cases. The assignment of factors to new cases can be treated as a learning problem. Each of CATO’s factors corresponds to a concept, for which there are positive and negative instances in the Case Database, together with the full-text opinion. The learning task can be defined as follows: For each of CATO’s factors: Given a set of full-text legal opinions, labeled whether the factor applies or not, Learn a classifier which determined for a previously unseen opinion, whether the factor applies or not. The classifiers for the individual factors are not equally difficult to learn. For some, we have as few as five positive instances in the Case Database, while others apply in almost half of the cases. Clearly, the factors with few positive instances are harder to induce. The factors also relate to distinct real-world situations of different character. For instance, f4, Non-Disclosure Agreement, is usually described in a more uniform way than f6, Security Measures. F6 captures a much wider variety of situations, and therefore, it is easier to derive a classifier for cases related to f4. In order to bring the texts in a feature/value representation that is computationally accessible by an ML algorithm, we treated the texts as bag-of-words. All punctuation and numbers, including citations of other cases were filtered out. We removed stop-words, and applied Porter’s algorithm for stemming. Terms included single words, and adjacent pairs of non stop-words. For term weighting, we used tfidf, after removing very rare words, and those that occurred in less that four documents. We did not consider any semantic information, including for instance sentence or paragraph breaks. For the experiments, we implemented the following methods: • Rocchio (Joachims 1996), • Winnow and Weighted Majority (Blum 1995), • Exponentiated Gradient and Widrow-Hoff (Lewis et al. 1996), and • Centroid Distance (or, in (Lang 1995), TFIDFPrototype). We also included the Naive Bayes classifier from the Libbow package by Andrew McCallum in the experiments. In order to assess performance, we decided to use accuracy as measure. In our application, we do not have enough For a description of the algorithms we adapted, see (Brüninghaus & Ashley 1997) This package for text classification is available via the World Wide Web from the Text Learning Group at Carnegie Mellon University. cases so that the usual interpretations of precision/recall would hold. Also, the absence of a factor can be very relevant in making legal arguments, which makes the recognition of positive and negative instances of the concepts relevant. Hence, we used accuracy as the main criterion to measure performance. However, for factors with few positive instances, high accuracy can be achieved by always labeling new cases as negative. Therefore, we also consider precision/recall, to discover this undesired behavior. The cases in the Case Database were assigned to test and training sets in 10-fold, randomized cross-validation runs. Disappointingly, in the experiments, all algorithms performed quite poorly on our data set (Brüninghaus & Ashley 1997). Only Rocchio and the Centroid Distance achieved acceptable results in terms of both, accuracy and precision/recall, for the factors where we had a relative large number of positive training instances. On all other factors, we often observed that the algorithms could not discover positive cases, and always labeled opinion texts as negative. These results did not confirm our hopes raised by results reported previously for experiments on different problems. For a comprehensive overview and comparison of text classification experiments, see (Yang 1997). We think that for our problem these elsewhere successful methods did not work that well because: • The factors are very complex concepts to be learned. A variety of circumstances and situation can influence the presence of a factor, and the information can be in multiple passages within the text. Sometimes, the evidence for a factor can be indirect. E.g., it is considered to be improper means (a factor favoring plaintiff in CATO), if defendant searches plaintiff’s garbage. The reasoning underlying this factor assignment is very hard to capture in existing text classification systems without any background or semantic knowledge. • CATO’s Case Database of 150 cases is magnitudes smaller than collections like the Reuters newswire stories or the Medline data (Yang 1997). Even so, it is fairly large for a CBR system and huge amounts of time and efforts went into it. It is not realistic to hope for a CBR case-base of a size comparable to the Reuters collection with thousands of fully indexed cases for a real application, so methods that can generalize reliably from small case-bases have to be found.
منابع مشابه
How Machine Learning can be Beneficial for Textual Case-Based Reasoning
In this paper, we discuss the benefits and limitations of Machine Learning (ML) for Case-Based Reasoning (CBR) in domains where the cases are text documents. In textual CBR, the bottleneck is often indexing new cases. While ML has the potential to help build large case-bases from a small start-up collection by learning to classify texts under the index-terms, we found in experiments with a real...
متن کاملReasoning with Textual Cases
Abstract. This paper presents methods that support automatically finding abstract indexing concepts in textual cases and demonstrates how these cases can be used in an interpretive CBR system to carry out case-based argumentation and prediction from text cases. We implemented and evaluated these methods in SMILE+IBP, which predicts the outcome of legal cases given a textual summary. Our approac...
متن کاملKnowledge Sources for Textual CBR Applications
Textual CBR applications address issues that have traditionally been dealt with in the Information Retrieval community, namely the handling of textual documents. As CBR is a knowledge-based technique, the question arises where items of knowledge may come from and how they might contribute to the implementation of a Textual CBR system. In this paper, we will show how various pieces of knowledge ...
متن کاملDefining Knowledge Layers for Textual Case-Based Reasoning
Textual CBR applications deal with problems that have traditionally been addressed by the Information Retrieval community, namely the handling of textual documents. Since CBR is an AI technique, the questions arise as to what kind of knowledge may enhance the system, where this knowledge comes from, and how it contributes to the performance of such a system. We will address these questions in t...
متن کاملEvaluation of Textual CBR Approaches
Evaluation is a crucial step in a research project, it demonstrates how well the chosen approach and the implemented techniques work, and can uncover limitations as well as point toward improvements and future research. A formal evaluation also facilitates comparing the project to previous work, and enables other researchers to assess its usefulness to their problems. Evaluating Textual CBR sys...
متن کامل